The data for this project was provided by Starbucks for the Udacity Data Scientist Nanodegree program. It is simulated (rather than actual) data that mimics reward offers and customer purchasing behavior over a 30-day test period.
The main question I would like to answer with this project is:
In order to answer this question, I plan to create a machine learning model that predicts transaction amounts based on demographic data and reward type. Using this model, I can identify which offer maximizes a member's predicted transaction and then recommend making that offer to them.
Three datasets are provided:
portfolio.json
Offers sent during a 30-day test period
profile.json
Rewards program users
transcript.json
Event log
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.legend_handler import HandlerTuple
from datetime import datetime
import progressbar
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, BaggingRegressor, GradientBoostingRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.ensemble import VotingRegressor
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# read in datasets
portfolio = pd.read_json('data/portfolio.json',orient='records',lines=True)
profile = pd.read_json('data/profile.json',orient='records',lines=True)
transcript = pd.read_json('data/transcript.json',orient='records',lines=True)
# Preview each dataframe and display the shape
display(portfolio.head())
print('portfolio shape:',portfolio.shape)
display(profile.head())
print('profile shape:',profile.shape)
display(transcript.head())
print('transcript shape:',transcript.shape)
Exploring the portfolio dataframe is easy because there are only 10 rows and 6 columns so the entire dataframe can be easily visualized.
# display the entire portfolio dataframe
portfolio
There are 10 different offers: 4 bogo's, 4 discounts, and 2 informational types. All of them use email as a channel, which means that including email doesn't provide a way to distinguish any of the offer types. I can drop this value from the 'channels' feature to simplify the analysis.
For the profile dataframe, lets first look at null values. Missing values in 'age' are represented by the value 118, while in 'gender' they are represented by the string 'None'. I will map these values to NaN's and then count the null values in each column.
# Map 118 to NaN in 'Age' column
profile['age'] = profile['age'].apply(lambda x: np.nan if x==118 else x)
profile.head()
# Map 'None' to NaN in 'gender column'
profile['gender'] = profile['gender'].apply(lambda x: np.nan if x==None else x)
profile.head()
# count the number of null values in each column of the profile dataframe
profile.isnull().sum()
# now count the number of rows with 3 null values
profile.isnull().sum(axis=1).value_counts()
Of the 17,000 users in the profile dataset, 2,175 are missing age, gender, and income data while the remaining 14,825 users are not missing any demographic data. The only identifying feature for the 2,175 users is the date they joined the Starbucks rewards program. This could be a useful feature later in the analysis, but if it isn't, then I'll feel confident dropping the users with missing values since I'll still have a majority of the dataset with complete information.
I also want to visualize the demographic data and calculate some descriptive statistics for each demographic feature.
# plot a histogram of the 'age' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['age'],bins=range(10,120,10))
ax.set_xlabel('Age (years)',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Ages',fontsize=16)
plt.show()
# plot a histogram of the 'became_member_on' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['became_member_on'],bins=range(20130000,20200000,10000))
ax.set_xticks(range(20130000,20200000,10000))
ax.set_xlabel('Enrollment year',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Enrollment',fontsize=16)
plt.show()
# plot a pie chart of the 'gender' feature
gender_labels = profile['gender'].value_counts().index.map({'M':'Male','F':'Female','O':'Other'})
fig, ax = plt.subplots(figsize=(16,9))
ax.pie(profile['gender'].value_counts(),labels=gender_labels,autopct='%1.f%%')
ax.set_title('Breakdown of Starbucks Rewards Member Genders',fontsize=16)
plt.show()
# plot a histogram of the 'income' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(profile['income'],bins=range(0,140000,20000))
ax.set_xlabel('Income',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Starbucks Rewards Member Income',fontsize=16)
plt.show()
# Generate descriptive statistics of the demographic data
profile.describe()
Visualizations and descriptive statistics of the demographic data in the profile dataset have given me a better understanding of the makeup of this user group. The rewards members range in age from 18 to 101 years with a mean age of ~54 years. The minimum age makes sense since it is likely an individual must be 18 years or older to join the rewards program, but I am a little surprised that there is a 101-year-old member and that the age distribution is centered around 55. I expected the distribution to be centered around users in their 20s and 30s. Member incomes range from \$14,825 to \\$120,000 with a mean of ~$65,400, which falls in line with my expectations for income distribution in the U.S. A majority (57\%) of members are male, which also surprised me since I expected it to be closer to a 50/50 split with female or skewed towards female consumers. Finally, membership enrollment ranges from July 2013 to July 2018 with enrollment numbers increasing at an increasing rate from 2013 through 2017 before decreasing in 2018. The decrease in 2018 is likely due to capturing an incomplete year of member enrollments. This trend in enrollments indicates that the Starbucks Rewards program is increasing in popularity and therefore there is an opportunity to influence the purchasing behavior of a growing number of Starbucks customers with this project.
For the transcript dataset, I don't expect there to be any missing data since these records were generated during the experiment and should have been set up to record all relevant information.
transcript.isnull().sum()
As expected, there are no missing values in the transcript dataset. To better understand the data in this dataset, I will have to use different exploration techniques than those used on the profile dataset since most of those features were continuous numerical or categorical types.
# plot a pie chart of the 'event' feature
fig, ax = plt.subplots(figsize=(16,9))
ax.pie(transcript['event'].value_counts(),labels=transcript['event'].value_counts().index,autopct='%1.f%%')
ax.set_title('Breakdown of Event Types')
plt.show()
# determine the number of unique members in the transcript dataframe
n_members = len(np.unique(transcript['person']))
print('{} members had an event captured in the transcript dataset'.format(n_members))
# group the transcript dataframe by 'person'
transcript_grouped = transcript.groupby('person')
# plot distribution of events per person
fig, ax = plt.subplots(figsize=(16,9))
ax.hist(transcript_grouped['event'].count(),bins=range(0,55,5))
ax.set_xlabel('Events per person',fontsize=12)
ax.set_ylabel('Count',fontsize=12)
ax.set_title('Distribution of Events per Person',fontsize=16)
plt.show()
# generate descriptive statistics on the events per person
transcript_grouped['event'].count().describe()
# confirm that all of the customers in the 'transcript' dataframe are also in the 'profile' dataframe
print('Number of users that are in the transcript dataset but not the profile dataset:',len(np.setdiff1d(np.unique(transcript['person']),profile['id'])))
# plot distribution of events over time
fig, ax = plt.subplots(figsize=(16,9))
ax.scatter(transcript.index,transcript['time'])
ax.set_xlabel('Event Number',fontsize=12)
ax.set_ylabel('Time since start of experiment (hour)',fontsize=12)
ax.set_title('Distribution of Events over Time',fontsize=16)
plt.show()
The transcript dataset is broken up fairly evenly between transactions (45%) and reward-based events consisting of offers received (25%), viewed (19%), and completed (11%). Based on the percentages, I noticed that approximately 80% of the rewards that were received were also viewed (19%/25%), and less than half (11%/25%) were completed. Each user had at least 1 event, and all of the users recorded in the transcript dataset are also in the profile dataset. Finally, the distribution of events over time shows a large number of events occurring in a single hour periodically throughout the 30-day experimental period. These periods of activity probably correspond with the distribution new reward offers, which can happen simultaneously if they are generated automatically. The way these reward offers are distributed in groups may serve to break up the experiment into multiple segments. This raises the question: are new offers only sent out after the previous offers expired? Or is there overlap between offers for a given customer? I will investigate this.
Now that I better understand the data that I am working with, I will clean it to make it easier to work with. This includes one hot encoding categorical features, extracting data from lists or dictionaries within the datasets, converting strings to dates and times, and remapping reward and member ids to something that is easier to comprehend.
# view the portfolio dataset
portfolio
# one hot encode the 'offer_type' feature of the 'portfolio' dataset
portfolio = pd.get_dummies(portfolio,prefix='',prefix_sep='',columns=['offer_type'])
portfolio
To one hot encode the 'channels' feature, I can't use get_dummies because there are multiple entries in each row. I will have to iterate through each row and manually assign values instead.
# create a list of all options for 'channels' (easy to do visually/manually in a small df)
channel_list = ['web','email','mobile','social']
# write a function to encode the channels in 'channels' into separate columns
def encode_column(df,column,col_list):
'''
One hot encodes a column of a dataframe when multiple entries occur in a single column
Drops the original column once encoding is done
Inputs
df (pandas dataframe): the dataframe on which to perform the encoding
column (string): name of the column to encode
col_list (list): all values which can be in 'column'
Returns
df (pandas dataframe): the original df with encoded columns and without the original column
'''
for col in col_list: # loop through the list of values in column
df.loc[:,col] = 0 # initialize each new column with 0's
# create progress bar to track progress
bar = progressbar.ProgressBar(maxval=len(df.index),widgets=[progressbar.Bar('=','[',']',' '),' ',progressbar.Percentage()])
bar.start()
cter = 1
for idx in df.index: # loop through the indices of the df
for value in df.loc[idx,column]:
df.loc[idx,value] = 1 # assign a value of 1 to the corresponding index,column that corresponds to 'value'
bar.update(cter) # update progress bar
cter += 1 # update counter
bar.finish()
# drop the original column
df = df.drop(columns=column)
return df
# run the 'encode_column' function on the 'channels' column in 'portfolio' df and assign to 'portfolio_encoded'
portfolio_encoded = encode_column(portfolio,'channels',channel_list)
# view new 'portfolio_encoded' df
portfolio_encoded
# since every reward can be delivered via email, this does not add any information to the analysis
# I will drop the 'email' column
portfolio_encoded.drop(columns='email',inplace=True)
The final bit of cleaning I will perform on the portfolio dataset is more cosmetic than functional. I will replace the 32 character alphanumeric 'id' with an integer value and group them by reward type. This will make it easier for me to quickly refer to the reward ids later in the analysis.
# sort the 'portfolio_encoded' df so that it is grouped by reward type, then increasing diffulty and duration
portfolio_encoded.sort_values(by=['bogo','discount','informational','difficulty','duration'],
ascending=[False,False,False,True,True],inplace=True)
portfolio_encoded
def id_mapper(df,column):
'''
Create a dictionary with keys corresponding to original ids and values corresponding to new integer ids
Inputs
df (pandas dataframe): the dataframe containing ids to remap
column (string): the name of the id column
Returns
id_dict (dict): dictionary with
'''
id_dict = {}
counter = 1 # initialize counter for reward_ids
for _id in df[column]:
if _id not in id_dict.keys():
id_dict[_id] = counter
counter+=1
return id_dict
# create the dictionary for reward ids
reward_id_dict = id_mapper(portfolio_encoded,'id')
# view the reward_id dictionary
reward_id_dict
def map_id_column(df,old_col,new_col,id_dict):
'''
Creates a new column in a dataframe with values based on the values in an existing column and a mapping dictonary
Drops the existing column
Inputs
df (pandas dataframe): the dataframe containing the column to map
old_col (string): the name of the existing column
new_col (string): the name of the new column to create
id_dict (dict): dictionary with old_col entries as keys and new_col entries as values
Returns
df (pandas dataframe): the original df with the new column added and the old column dropped
'''
df[new_col] = df[old_col].apply(lambda x: id_dict.get(x,0)) # create new column with values mapped from the old column
df.drop(columns=old_col,inplace=True) # drop the old column
return df
portfolio_encoded = map_id_column(portfolio_encoded,'id','reward_id',reward_id_dict)
portfolio_encoded
# save portfolio_encoded in pickle file
pickle.dump(portfolio_encoded,open('pickle_files/portfolio_encoded.p','wb'))
# load portfolio_encoded from pickle file
portfolio_encoded = pickle.load(open('pickle_files/portfolio_encoded.p','rb'))
# create an inverse dictionary mapper in case I want to access the original ids
def id_mapper_inverse(id_dict):
'''
Creates an inverse dictionary where the keys become the values and the values become the keys
Input
id_dict (dict): the original dictionary to invert
Returns
id_dict_inv (dict): the inverted dictionary
'''
id_dict_inv = {} # initialize dictionary
for key, value in id_dict.items(): # loop through the key-value pairs
id_dict_inv[value] = key # assign the value as the key and the key as the value of the inverse dictionary
return id_dict_inv
# test the id_mapper_inv function on the reward_id_dict
reward_id_dict_inv = id_mapper_inverse(reward_id_dict)
reward_id_dict_inv
The portfolio dataset now consists of only numeric values. The reward_ids are grouped in a way the I can remember and quickly refer to later:
I have already handled the missing values in the profile dataset. Now I will convert the enrollment date ('became_member_on') string to a date object and a timestamp and map the alphanumeric 'id' to a numeric user_id.
# convert 'became_member_on' string to date object
profile['enrollment_date'] = profile['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d').date())
profile['enrollment_tstamp'] = profile['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d').timestamp())
profile.head()
# one hot encode 'gender' column
profile_encoded = pd.get_dummies(profile,columns=['gender'])
# drop the 'became_member_on' column
profile_encoded.drop(columns='became_member_on',inplace=True)
# preview df
profile_encoded.head()
# sort the profile dataset by enrollment date then use the id_mapper function
member_id_dict = id_mapper(profile_encoded.sort_values(by=['enrollment_date','age']),'id')
profile_encoded = map_id_column(profile_encoded,'id','member_id',member_id_dict)
profile_encoded.head()
# create an inverse member_id dictionary
member_id_dict_inv = id_mapper_inverse(member_id_dict)
For the transcript dataset, I need to one hot encode the 'event' column, map the 'person' column to the numeric 'member_id' using the member_id dictionary, and extract the values from the dictionaries in the 'value' column.
# replace spaces with underscores in 'event' column
transcript['event'] = transcript['event'].apply(lambda x: x.replace(' ','_'))
# one hot encode the 'event' column
transcript_encoded = pd.get_dummies(transcript,prefix='',prefix_sep='',columns=['event'])
# preview the new df
transcript_encoded.head()
# map the alphanumeric 'person' column to 'member_id'
transcript_encoded = map_id_column(transcript_encoded,'person','member_id',member_id_dict)
transcript_encoded.head()
# identify all possible options for keys in the 'value' column
value_set = set() # initialize an empty set
for row in transcript_encoded['value']: # loop through the 'value' column
for key in row.keys():
value_set.add(key) # add the key from each dictionary to the set
# view the set
value_set
# create new columns for the keys in the 'value' column
for col in list(value_set):
transcript_encoded[col] = 0 # initialize them with 0's
transcript_encoded.head()
# # loop through the indices and assign the values in the 'value' dictionaries to the columns matching the keys
# # use a progress bar to track progress (takes a while)
# bar = progressbar.ProgressBar(maxval=transcript_encoded.index[-1],widgets=[progressbar.Bar('=','[',']',' '),' ',progressbar.Percentage()])
# bar.start()
# for idx in transcript_encoded.index:
# value_dict = transcript_encoded.loc[idx,'value']
# for key, value in value_dict.items():
# transcript_encoded.loc[idx,key] = value
# bar.update(idx)
# bar.finish()
# # save the dataframe in a pickle file to skip the previous step in future runs
# pickle.dump(transcript_encoded,open('pickle_files/transcript_encoded.p','wb'))
# load the pickle file
transcript_encoded = pickle.load(open('pickle_files/transcript_encoded.p','rb'))
# preview the dataframe loaded from the pickle file
transcript_encoded.head()
# count the number of rows where each of the 'value' features has been populated
(transcript_encoded.iloc[:,-4:] != 0).sum()
# check if there are any rows where both 'offer id' and 'offer_id' are populated
print('The number of rows where both "offer id" and "offer_id" are populated:',
((transcript_encoded.loc[:,'offer id'] !=0) & (transcript_encoded.loc[:,'offer_id'] !=0)).sum())
# since there are no rows where both 'offer id' and 'offer_id' are populated, I can combine them into one column
transcript_encoded['offer_id'] = transcript_encoded[['offer id','offer_id']].apply(lambda x: x['offer id'] if x['offer id'] != 0 else x['offer_id'],axis=1)
# drop the 'offer id' column and 'value' columns
transcript_encoded.drop(columns=['value','offer id'],inplace=True)
# preview the dataframe
transcript_encoded.head()
# map the offer_ids to reward_ids
transcript_encoded = map_id_column(transcript_encoded, 'offer_id','reward_id',reward_id_dict)
# preview the dataframe
transcript_encoded.head()
Now that the transcript dataset has been cleaned and only numeric values remain, I will transform the dataset to group transactions by member_id and determine if transactions occured while an offer was active or not. There are three criteria to determine if an offer is active at a given time: 1) the offer was viewed, 2) the offer has not expired, and 3) the offer has not been completed (applies to BOGO offers).
def get_member_transcript(member_id, df=transcript_encoded):
'''
Get all actions for a given member
Input
member_id (int): the numeric member_id for the user whose transcript you want to view
df (dataframe): the dataframe from which to extract the appropriate rows
Returns
transcript (dataframe): All actions for that member in the dataset, sorted by time
'''
transcript = df.loc[df['member_id']==member_id,:]
transcript = transcript.sort_values(by='time')
return transcript
# test get_member_transcript function
get_member_transcript(7566)
def get_member_transactions(member_id, df=transcript_encoded):
'''
Get all transactions for a given member
Input
member_id (int): the numeric member_id for the user whose transactions you want to view
df (dataframe): the dataframe from which to extract the appropriate rows
Returns
transactions (dataframe): All transactions for that member in the dataset, sorted by time
'''
transactions = df.loc[(df['member_id']==member_id)&(df['transaction']==1),:]
transactions = transactions.sort_values(by='time')
return transactions
# test get_member_transactions function
get_member_transactions(7566)
def create_active_offer_dict(df):
'''
Creates a dictionary with keys corresponding to member_id's and values are tuples containing the following:
- time offer was viewed
- time offer expired
- reward_id
- time offer was completed (optional)
Input
df (pandas dataframe): should contain time, action (offer received, viewed, completed, transaction), and reward_id
Returns
active_offer_dict: the dictionary described above
'''
# merge df with the portfolio_encoded df to add 'duration' column
df = df.merge(portfolio_encoded,how='left',on='reward_id',suffixes=('_trans','_offer'))
# initialize dictionary
active_offer_dict = {}
for member in np.unique(df['member_id']): # loop through all member ids in df
# create dataframe with only transactions for one member
member_trans_df = get_member_transactions(member, df)
# create dataframe with only rows relating to offers received or viewed
offer_rec_view_df = member_trans_df.loc[(member_trans_df['offer_received']==1) | (member_trans_df['offer_viewed'] == 1),:]
if offer_rec_view_df.empty:
continue # proceed to the next member if there are no offers received
# create dataframe with only rows relating to offers completed
completions_df = member_trans_df.loc[member_trans_df['offer_completed']==1,:]
# shift 'time' values up one row in offer_rec_view_df to align the time an offer was viewed
# with the offer that was received. store in an array
time_viewed = offer_rec_view_df.groupby('reward_id')['time'].shift(-1)
# calculate the times that offers expire by adding the offer duration (converted from days to hours)
# to the time the offer was received. store in an array
time_expired = offer_rec_view_df.apply(lambda x: x['time']+x['duration']*24 if x['offer_received']==1 else np.nan,axis=1)
# create list of reward_ids for the offers that were received
reward_id_list = offer_rec_view_df['reward_id']
# combine the time_viewed, time_expired, and reward_id_list arrays into a df (to easily remove NaN's)
offer_window_df = pd.DataFrame({'time_viewed':time_viewed,'time_expired':time_expired,'reward_id':reward_id_list})
# drop missing values
offer_window_df.dropna(how='any',inplace=True)
# create list of tuples with format (time_viewed, time_expired, reward_id)
offer_window_list = list(zip(offer_window_df['time_viewed'],offer_window_df['time_expired'],offer_window_df['reward_id']))
# add the time an event was completed to the appropriate tuple in offer_window_list
for row in completions_df.index:
time = completions_df.loc[row,'time'] # extract the time the offer was completed
reward_id = completions_df.loc[row,'reward_id'] # extract the associated reward_id
for idx in range(len(offer_window_list)): # loop through all active offer tuples
start = offer_window_list[idx][0]
end = offer_window_list[idx][1]
reward = offer_window_list[idx][2]
if (time >= start) & (time <= end) & (reward_id == reward):
offer_window_list[idx] += (time,) # add the offer completion time to the matching active offer
# update the dictionary with the member_id as the new key and the list of active offer tuples as the value
active_offer_dict[member] = offer_window_list
return active_offer_dict
active_offer_dict = create_active_offer_dict(transcript_encoded)
def get_active_offers(transaction_time, member_id):
'''
Identify the active offer(s) (if any) for a given transaction
Input
transaction_time (int): the hour when the transaction occurred
Returns
offers_active (list): list of the reward_ids of the active offers or 0 if no offers are active
'''
offers_active = [] # intialize empty list
offer_list = active_offer_dict.get(member_id,0) # get the list of offers for that member or return 0
if offer_list == 0: # if 0 is returned, member has no offers, so return [0]
return [0]
else:
for offer in offer_list: # loop through all offers for the given member
start = offer[0] # extract time_viewed as start of active window
if len(offer) == 4: # if offer was completed
end = offer[3] # extract time completed as end of active window
else:
end = offer[1] # or extract expiration time as end of active window
reward = offer[2] # extract reward_id
if (transaction_time >= start) & (transaction_time <= end):
# append reward to the offers_active list if the transaction time is in the active offer window
offers_active.append(reward)
if len(offers_active) == 0:
offers_active.append(0) # append 0 if there are no active offers (rather than return an empty list)
return offers_active
test_transactions = get_member_transactions(7566)
offers_active = []
for time, member in zip(test_transactions['time'],test_transactions['member_id']):
offers_active.append(get_active_offers(time,member))
offers_active
def insert_active_offers(df, col_num):
'''
Inserts the list of active offer lists into the df at the specified location with the column name 'active_offers'
Inputs
df (dataframe): the dataframe to insert active offers into, likely a df of transactions
col_num (int): the location at which to insert the column
Returns
df (dataframe): the input df with the inserted 'active_offers' column
'''
active_offers = [] # initialize list
for time, member in zip(df['time'],df['member_id']):
active_offers.append(get_active_offers(time,member))
df.insert(col_num,'active_offers',active_offers)
return df
# Create transactions df and run insert_active_offers function
transactions = transcript_encoded.loc[transcript_encoded['transaction']==1,:]
transactions = insert_active_offers(transactions,7)
# preview transactions df
transactions.head()
# # run encode_column function on 'active_offers' to one hot encode (takes a while)
# transactions_encoded = encode_column(transactions,'active_offers',range(11))
# # save to a pickle file to avoid step in future
# pickle.dump(transactions_encoded,open('pickle_files/transactions_encoded.p','wb'))
# # preview transactions_encoded df
# transactions_encoded.head()
# load the transactions_encoded df from the pickle file
transactions_encoded = pickle.load(open('pickle_files/transactions_encoded.p','rb'))
transactions_encoded.head()
# merge the profile dataset with the transactions_encoded dataset to add demographic data as features
learning_df = transactions_encoded.merge(profile_encoded,how='left',on='member_id',suffixes=('_trans','_prof'))
# preview new df
learning_df.head()
# drop columns that are unnecessary for machine learning
learning_df.drop(columns=['time','offer_completed','offer_received','offer_viewed','transaction','member_id',\
'reward','reward_id','enrollment_date'],inplace=True)
# fill missing values ('age','income') with the mean
learning_df = learning_df.fillna(learning_df.mean())
# save the df to a pickle file
pickle.dump(learning_df,open('pickle_files/learning_df.p','wb'))
# preview df
learning_df.head()
Now that I have cleaned and extracted features, it is time to use those features to model transaction behavior. The goal of the model is to predict the amount a member will spend in a transaction based on their demographic information (age, gender, income, length of membership) and the reward that is active (if any). Since the output variable ('amount') is continuous, I will use regression models and score them using mean squared error. Before training the models, I will have to split the dataset into training and testing sets and normalize the age, income, and enrollment_tstamp features to match the scale of the categorical features (0 to 1).
# load the cleaned and transformed dataframe to be used for modeling
learning_df = pickle.load(open('pickle_files/learning_df.p','rb'))
def create_train_test_split(df=learning_df,test_size=0.3):
'''
Splits a df into input (X) and output (y) arrays, then splits the arrays into training and testing sets
Note: since the transaction data is time-based, I will not shuffle the dataset and instead use the older
transaction as training data and the more recent transactions as the test data.
Inputs
df (pandas dataframe): the learning_df with 'amount' column as output feature
test_size (float): size of the test dataset
'''
# Split the df into input (X) and output (y) arrays
X = df.drop(columns='amount')
y = df['amount']
# use sklearn's train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size,shuffle=False)
return X_train, X_test, y_train, y_test
# run create_train_test_split on the dataset
X_train, X_test, y_train, y_test = create_train_test_split()
scaler=MinMaxScaler() # instantiate a MinMaxScaler
X_train = scaler.fit_transform(X_train) # fit the scaler to the input training data
X_test = scaler.transform(X_test) # transform the input testing data based on the fit of the training data
I want to quickly instantiate, fit, predict, and evaluate various regression models to determine which model performs best on my dataset. To do that, I will write a couple functions to avoid repeating myself. The first function, fit_test_model will fit and predict values then compare them to the actual values for the training and testing data. The second function, plot_pred_actual will call the fit_test_model function and then use the outputs from that function to make a scatter plot of the predicted vs. actual values for the training dataset. This will allow me to quickly fit a new model and evaluate not only the score (MSE or MAE) but also visually inspect the shape of the error to better understand how a model is performing.
def fit_test_model(model, X_train, X_test, y_train, y_test):
'''
Fits a model to training data, makes predictions on training and testing data,
calculates mean squared error and mean absolute error for training and testing data sets.
Prints the mean squared and mean absolute errors for the datasets.
Inputs:
model: an instantiated sklearn model to fit and test
X_train (array): normalized training dataset
X_test (array): normalized testing dataset (have same number of columns as X_train)
y_train (1D array): output values for the training dataset (must have same number of rows as X_train)
y_test (1D array): output values for the testing dataset (must have same number of rows as X_test)
Returns:
model: the input model fit to the training data
y_preds_train (1D array): the predictions for the training dataset (same shape as y_train)
y_preds_test (1D array): the predictions for the testing dataset (same shape as y_test)
'''
model.fit(X_train, y_train)
y_preds_train = model.predict(X_train)
y_preds_test = model.predict(X_test)
print('Training MSE:',mean_squared_error(y_train,y_preds_train))
print('Testing MSE:',mean_squared_error(y_test,y_preds_test))
print('Training MAE:',mean_absolute_error(y_train,y_preds_train))
print('Testing MAE:',mean_absolute_error(y_test,y_preds_test))
return model, y_preds_train, y_preds_test
def plot_pred_actual(model, X_train, X_test, y_train, y_test):
'''
Calls the fit_test_model function then displays a scatter plot of the predicted vs actual values for the training dataset
Inputs:
model: an instantiated sklearn model to fit and test
X_train (array): normalized training dataset
X_test (array): normalized testing dataset (have same number of columns as X_train)
y_train (1D array): output values for the training dataset (must have same number of rows as X_train)
y_test (1D array): output values for the testing dataset (must have same number of rows as X_test)
Returns:
None
'''
model, y_preds_train, y_preds_test = fit_test_model(model, X_train, X_test, y_train, y_test)
fig, ax = plt.subplots(figsize=(16,9))
plt.scatter(y_train,y_preds_train)
plt.xlabel('Actual Transaction Amounts',fontsize=12)
plt.ylabel('Predicted Transaction Amounts',fontsize=12)
plt.title('Predicted vs Actual Transaction Amounts for Training Data',fontsize=16)
plt.show()
# call the plot_pred_actual function on a default linear regression model
plot_pred_actual(LinearRegression(),X_train, X_test, y_train, y_test)
There are quite a few transaction amounts in the \$200 - \\$1,000 range, which seem like extreme values compared to the rest of the dataset. I need to investigate how these values affect the overall dataset before handling them in some way. Intuitively, I don't believe there are enough input features to be able to predict transactions ranging from \$0 - \\$1,000.
# create a boxplot to see if these values are considered outliers
fig, ax = plt.subplots(figsize=(16,6))
plt.boxplot(learning_df['amount'],vert=False)
plt.title('Boxplot of Transaction Amounts',fontsize=16)
plt.show()
# check the descriptive statistics for the 'amount' feature as well
learning_df['amount'].describe()
Despite the presence of transactions up to \$1,062, the descriptive statistics do not appear to be heavily influenced by these outliers. The mean (\\$12.78) transaction amount is slightly higher than the median (\$8.89) transaction amount, indicating the data is (obviously) skewed right, but it is still closer to the median than to the 75th percentile. After researching how to handle outliers in output variables, I found a very useful Medium article discussing several methods to handle outliers. Those methods include: 1) using tree based models, 2) performing a log-scale transformation, 3) dropping values based on the IQR method, and 4) using mean absolute error (MAE) instead of mean squared error (MSE) since MSE gives more weight to larger errors. I will evaluate all of these methods.
# call the plot_pred_actual function on a random forest regression model with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100),X_train, X_test, y_train, y_test)
# call the plot_pred_actual function on a bagging regression model with 100 estimators
plot_pred_actual(BaggingRegressor(n_estimators=100),X_train, X_test, y_train, y_test)
# test a random forest regression model on log-transformed output with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100),X_train, X_test, np.log(y_train), np.log(y_test))
# test a random forest regression model on log-transformed output with 100 estimators
plot_pred_actual(BaggingRegressor(n_estimators=100),X_train, X_test, np.log(y_train), np.log(y_test))
# use the IQR method identify the upper limit above which values are considered outliers (Q3 + 1.5*IQR)
Q3 = np.percentile(learning_df['amount'],75)
Q1 = np.percentile(learning_df['amount'],25)
IQR = Q3-Q1
upper_limit = Q3 + 1.5*IQR
print('The upper limit using the IQR method is {:.2f}'.format(upper_limit))
# filter learning_df based on the upper limit
learning_df_IQR = learning_df.loc[learning_df['amount']<=upper_limit,:]
rows_dropped = learning_df.shape[0]-learning_df_IQR.shape[0]
print('{} out of {} rows ({:.2%}) were removed using the IQR method'\
.format(rows_dropped,learning_df.shape[0],rows_dropped/learning_df.shape[0]))
There are 1,236 transactions out of the 138,953 in the transactions dataset that are greater than \$41.00 and therefore considered outliers based on the IQR method. This is an insignifcant (less than 1\%) number of values in the data set that have a large impact on the ability to predict the transaction amounts with the available features. The tree based ensemble methods (Random Forest and Bagging) struggled to handle these outliers and the scatter plots showed the models ended up predicting transaction amounts on the order of hundreds of dollars for some transactions that were actually less than \\$100 and vice versa - predicting on the order of low hundreds when the actual values were around \$1,000. Performing a log transformation on the output variable appeared to have better results, with the scatter appearing closer to linear between predicted and actual values. However, the outliers are still visible and following a different trend than the rest of the data set.
Considering the small number of outliers and the large effect they are having on the predictive ability of the models, I am comfortable simply dropping them from the dataset. The goal of the project is to make recommendations on which reward to offer to a given member based on their demographic data in order to maximize the amount of money the spend at Starbucks. The largest reward is a BOGO for \$10. If a member is making a purchase on the order of hundreds of dollars, I don't think a \\$10 reward is going to affect whether they make that purchase or not. I would rather focus on the smaller transaction amounts (less than \$41) where a \\$5 or \$10 reward could significantly impact their purchasing behavior. I will try dropping the outliers and fitting the models to the slightly reduced dataset.
# recreate the training and testing datasets without the outliers
X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR = create_train_test_split(learning_df_IQR)
scaler_IQR = MinMaxScaler()
X_train_IQR = scaler_IQR.fit_transform(X_train_IQR)
X_test_IQR = scaler_IQR.transform(X_test_IQR)
# save MinMaxScaler in pickle file to be used in web app
pickle.dump(scaler_IQR,open('pickle_files/scaler.p','wb'))
# load MinMaxScaler from pickle file
scaler_IQR = pickle.load(open('pickle_files/scaler.p','rb'))
# test a default linear regression model
plot_pred_actual(LinearRegression(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# test a Random Forest regressor with 100 estimators
plot_pred_actual(RandomForestRegressor(n_estimators=100), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# increase the number of features using PolynomialFeatures
# then test a linear regression model on the increased number of features
poly_model2 = Pipeline([('poly', PolynomialFeatures(degree=2)),
('linear', LinearRegression(fit_intercept=False))])
plot_pred_actual(poly_model2, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try creating 3rd degree polynomial features with a linear model
poly_model3 = Pipeline([('poly', PolynomialFeatures(degree=3)),
('linear', LinearRegression(fit_intercept=False))])
plot_pred_actual(poly_model3, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try 2nd degree polynomial features with a random forest regressor
poly_model2 = Pipeline([('poly', PolynomialFeatures(degree=2)),
('forest', RandomForestRegressor(n_estimators=100))])
plot_pred_actual(poly_model2, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try a bagging regressor with 10 estimators
plot_pred_actual(BaggingRegressor(n_estimators=10), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try weighing the outputs of a linear, random forest, and bagging model
reg1=LinearRegression()
reg2=RandomForestRegressor(n_estimators=100)
reg3=BaggingRegressor(n_estimators=100,max_samples=0.4)
voter = VotingRegressor([('lr',reg1),('rf',reg2),('bg',reg3)],[.2,.4,.4])
plot_pred_actual(voter, X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try a linear Support Vector regression model
plot_pred_actual(LinearSVR(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# try a default Stochastic Gradient Descent regressor
plot_pred_actual(SGDRegressor(), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
The default tree based (Bagging and Random Forest) regression models appear to have the best mean absolute error for the training (less than 3) and testing (less than 4) data sets of the various ensemble methods that I have tested. All of the linear models (Linear Regression, Linear Regression with Polynomial Features, Linear SVR, and SGD) all had mean absolute errors greater than 4 for both the training and testing data sets. Furthermore, the scatter plots of predicted vs. actual values for the training data of the tree based models show relationships that are closest to linear. They tend to underpredict the larger amounts (>\$30) but there is a much tighter clustering for the low amounts. Again, I find the ability to predict the smaller transaction amounts to be much more important than the larger amounts since a reward could have more effect on purchasing decisions for smaller transactions. After reading some posts about the difference between Bagging and Random Forests, I have learned that Bagging uses all of the features to determine the most effective split in the data, whereas Random Forests use a random subset of the features. There are only 17 features on which to train a model, and 14 of those are mutually exclusive (11 reward offers, including the 'no reward' option, and the 3 gender options) meaning that there are effectively only 5 features from which to make predictions for any given data point. With such a small number of features, I do not want to limit how many are used to train any one estimator, so I will choose to focus on tuning a Bagging regressor rather than a Random Forest regressor for my final model.
There are two main hyperparameters for Bagging regression: number of estimators (n_estimators) and max_samples. I will perform a grid search on these two hyperparameters using mean absolute error as the scoring metric to tune my model.
# Run a grid search on Bagging Regressor hyperparameters
scorer = make_scorer(mean_absolute_error, greater_is_better=False)
bg = BaggingRegressor()
parameters = {'n_estimators':[10,100,200],'max_samples':[0.25,0.5,0.75,1.0]}
model = GridSearchCV(bg,parameters,scoring=scorer,cv=5)
model.fit(X_train_IQR, y_train_IQR)
# view the parameters of the best estimator
model.best_estimator_
# run the best_estimator model
plot_pred_actual(BaggingRegressor(n_estimators=200,max_samples=0.5), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# store the fitted model with the best parameters (200 estimators, 0.5 max_samples)
model, y_preds_train, y_preds_test = fit_test_model(BaggingRegressor(n_estimators=200,max_samples=.5), X_train_IQR, X_test_IQR, y_train_IQR, y_test_IQR)
# view the scatter in the predicted vs actual values for the test set
fig, ax = plt.subplots(figsize=(16,9))
plt.scatter(y_test_IQR,y_preds_test)
plt.xlabel('Actual Transaction Amounts',fontsize=12)
plt.ylabel('Predicted Transaction Amounts',fontsize=12)
plt.title('Predicted vs Actual Transaction Amounts for Testing Data',fontsize=16)
plt.show()
# save the model in a pickle file for future access
pickle.dump(model,open('pickle_files/model.p','wb'))
Tuning the Bagging regressor with Grid Search resulted in a slightly worse mean absolute error for the training data set (2.81 for the tuned model compared to 2.58 for the default model) but a slightly better score for the testing data set (3.83 vs 3.97). The default model uses 10 estimators with a max_samples value of 1.0, meaning that all of the samples could be drawn for a given estimator. The best estimator, meanwhile, had 200 estimators with a max_sample value of 0.5. The fact that the testing scored improved despite the training score getting worse means that the original model was probably slightly overfit on the training data. Increasing the number of estimators while decreasing the maximum number of samples that could be drawn for each estimator helped to reduce the variance of the model and therefore improve the model's predictive ability.
Now that I have a model, I need to be able to make predictions for an individual user based on which reward is offered (if any). To do this I will write a function that takes a fitted model and the user demographic data and loops through all of the reward options, makes and stores the prediction for each option so that I can identify which reward predicts the largest transaction amount for that user.
# load the model from the pickle file
model = pickle.load(open('pickle_files/model.p','rb'))
# grab a single user to test
test_user=X_train_IQR[0]
test_user
# check the order of the columns in the training dataset
learning_df.columns
# the first column is the output variable, 'amount', followed by the reward_id columns 0-10.
# that means in the training dataset, the reward_ids are the first 11 columns
# set all the reward columns for the test user to 0
test_user[:11] = 0
test_user
# set the first column (reward_id=0) to 1
test_user[0] = 1
# make a prediction on the transaction amount for the test user with no reward offered
model.predict(test_user.reshape(1,len(test_user)))
# make a prediction with only 1 reward active at a time
test_user_preds = np.array([]) # initialize an empty array
test_user[:11] = 0 # set the reward booleans to 0
for idx in range(11): # loop through each reward id
test_user[idx]=1 # set the reward id to 1 (active)
test_user_preds = np.append(test_user_preds,model.predict(test_user.reshape(1,17))) # make prediction
test_user[idx]=0 # reset reward id to 0 (inactive) for next iteration
# extract the reward values from the portfolio_encoded df
reward = portfolio_encoded['reward'].values
reward = np.insert(reward,0,0) # insert a reward of 0 for reward id 0 (no reward)
reward # preview
# extract the difficulty values from the portfolio_encoded df
difficulty = portfolio_encoded['difficulty'].values
difficulty = np.insert(difficulty, 0, 0) # insert a difficulty of 0 for reward id 0 (no reward)
difficulty # preview
# compare predicted transaction amounts to difficulties to see if reward was met
reward_met = np.greater_equal(test_user_preds,difficulty)
reward_met
# subtract the reward value from the transaction amounts of the reward types that were met
reward_adjusted_preds = np.subtract(test_user_preds, reward,out=test_user_preds,where=reward_met)
reward_adjusted_preds # preview
# plot the adjusted predicted transactions for each reward
fig, ax = plt.subplots(figsize=(16,9))
plt.bar(range(11),reward_adjusted_preds)
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Predicted transaction amount (USD)',fontsize=12)
plt.title('Predicted transaction amount vs Reward ID',fontsize=16)
plt.show()
def make_member_predictions(model, member_inputs):
'''
Makes predictions for the amount a member will spend based on each of the 10 reward offers or no reward offer
Deducts the reward amount if the predicted amount is greater than the difficulty to complete the reward
Inputs:
model: an sklearn model fit to training data
member_inputs (2D array): normalized values for the input features with shape (# members, 17 features)
the features (columns) must be in this order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 'age','income',
'enrollment_tstamp','gender_F','gender_M','gender_O'
Returns:
reward_adjusted_preds (2D array): the reward_adjusted predicted transaction amounts for the 11 reward options (no reward + 10 reward types)
- shape (# members, 11 columns)
'''
inputs = np.copy(member_inputs) # create a copy to avoid changing the original inputs
member_preds = np.zeros((inputs.shape[0],11)) # initialize array to store member predictions
for member_num in range(inputs.shape[0]): # iterate through the inputs for each member
inputs[member_num,:11] = 0 # reset the reward booleans to 0 so that only 1 reward is active at a time
for idx in range(11): # loop through each reward_id (0 through 10)
inputs[member_num,idx]=1 # set the reward_id boolean to 1 (True)
# append the prediction to the member_preds array
member_preds[member_num,idx] = model.predict(inputs[member_num,:].reshape(1,17))
inputs[member_num,idx]=0 # set the reward_id back to 0
reward_met = np.greater_equal(member_preds,difficulty)
reward_adjusted_preds = np.subtract(member_preds, reward,out=member_preds,where=reward_met)
return reward_adjusted_preds
member_preds = make_member_predictions(model,X_train_IQR[:2])
member_preds
def transform_demographic_data(age,income,enrollment_date,gender):
'''
Take raw demographic data (age, income, enrollment date, gender) and transform it:
- convert date to timestamp, encode gender, scale all values
- add 11 zeros in front of data to create input array for `make_member_predictions`
Inputs:
age (int or float): member age in years or NaN if age is unknown (if unknown will use mean age)
income (float): member annual income in USD or NaN if unknown (if unknown will use mean income)
enrollment_date (datetime object): the date (mm-dd-YYYY) that a member enrolled
gender (string): 'F','M','O', or 'U' (for unknown)
Returns:
member_inputs (1x17 array): normalized array containing values for:
reward_ids (0-10), age, income, enrollment date, and genders (F,M,O)
'''
mean_age = learning_df['age'].mean() # calculate the mean age from the dataset (used to replace nulls)
mean_income = learning_df['income'].mean() # calculate mean income from the dataset (used to replace nulls)
if pd.isnull(age):
age = mean_age # if age is null, replace with mean age
if pd.isnull(income):
income=mean_income # if income is null, replace with mean income
tstamp = enrollment_date.timestamp() # convert enrollment date from datetime object to timestamp (for scaling)
# create the gender encoding array based on the gender provided (if any)
if gender == 'F':
gen_array = np.array([1,0,0])
elif gender == 'M':
gen_array = np.array([0,1,0])
elif gender == 'O':
gen_array = np.array([0,0,1])
else:
gen_array = np.array([0,0,0])
member_inputs_raw = np.zeros((1,17)) # create 1x17 array of zeros to initialize the member inputs
member_inputs_raw[0,11:14] = [age, income, tstamp] # replace elements 11-13 with the age, income and tstamp
member_inputs_raw[0,14:] = gen_array # replace the last 3 elements with the gender encoding array
member_inputs = scaler_IQR.transform(member_inputs_raw) # scale the inputs using the MinMaxScaler
return member_inputs.reshape(1,17) # return the inputs in a 1x17 array
# test transform_demographic_data
member_inputs = transform_demographic_data(27,75000,datetime.strptime('2014-06-15','%Y-%m-%d'),'M')
member_inputs
# test that transform_demographic_data can handle NaN's
member_inputs = transform_demographic_data(np.nan,np.nan,datetime.strptime('2014-06-15','%Y-%m-%d'),'M')
member_inputs
# test that make_member_predictions can handle output of transform_demographic_data
member_preds = make_member_predictions(model,member_inputs)
member_preds
def plot_reward_preds(preds):
'''
Creates a bar chart of predicted transaction amounts for all 11 reward options
Input:
preds (1x11 array): the transaction amounts for the reward options, typically the output of make_member_predictions
Returns:
None
'''
fig, ax = plt.subplots(figsize=(16,9)) # create a large figure
plt.bar(range(11),preds.flatten()) # add a bar plot
ax.set_xticks(range(11))
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Prediction transaction amount (USD)',fontsize=12)
plt.title('Predicted transaction amounts vs Reward ID',fontsize=16)
plt.show()
# test the plot_reward_preds function
plot_reward_preds(member_preds)
# create a new function to identify the best reward ID
def best_reward(preds, show_plot=True):
'''
Determines which reward ID results in the maximum transaction amount
Plots the predicted transaction amounts with the maximum amount in red if 'show_true' is set to True
Inputs:
preds (1x11 array): the transaction amounts for the reward options, typically the output of make_member_predictions
show_plot (boolean): if True, plots the predicted transaction amounts for each reward option with the best reward in red
Returns:
best_reward (int): the reward id corresponding to the maximum predicted transaction amount
'''
best_reward = np.argmax(preds.flatten()) # identify the index (aka reward_id) corresponding to the max trans. amt.
if show_plot: # make a bar plot if show_plot is True
fig, ax = plt.subplots(figsize=(16,9))
barlist = plt.bar(range(11),preds.flatten())
barlist[best_reward].set_color('red')
ax.set_xticks(range(11))
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Prediction transaction amount (USD)',fontsize=12)
plt.title('Predicted transaction amounts vs Reward ID',fontsize=16)
plt.show()
return best_reward
# test the best_reward function
best_reward_id = best_reward(member_preds)
print('The best reward is {}.'.format(best_reward_id))
Now that I have a tuned model and a function to identify the best reward option for an individual, I want to see if there are any patterns in the demographic data, i.e:
To do that, I will grab the age, income, enrollment date, and gender for each member in the profile data set and use the transform_demographic_data, make_member_predictions, and best_reward functions to assign a recommended reward. After that, I can group the members by different demographics and see which reward is recommended for each group.
# create a dataframe with only the demographic data from the profile df
member_preds_df = profile[['age','income','became_member_on','gender']]
# convert the 'became_member_on' column to a datetime object and store in 'enrollment_date'
member_preds_df['enrollment_date'] = member_preds_df['became_member_on'].apply(lambda x: datetime.strptime(str(x),'%Y%m%d'))
# drop the 'became_member_on' column
member_preds_df = member_preds_df.drop(columns='became_member_on')
# preview df
member_preds_df.head()
# # create a new column with the best reward for each member (takes a while)
# member_preds_df['best_reward'] = member_preds_df.apply(lambda x: best_reward(make_member_predictions(model,transform_demographic_data(x['age'],x['income'],x['enrollment_date'],x['gender'])),show_plot=False),axis=1)
# # preview df
# member_preds_df.head()
# # save the member_preds_df in a pickle file to skip the previous step in future runs
# pickle.dump(member_preds_df,open('pickle_files/member_preds_df.p','wb'))
# load pickle file
member_preds_df = pickle.load(open('pickle_files/member_preds_df.p','rb'))
# preview df
member_preds_df.head()
def plot_reward_freqs(counts, demographic):
'''
Determines which reward ID is recommended the most
Plots the recommendation frequencies with the maximum frequency in red
Inputs:
counts (1x11 array): the counts for the reward options
demographic (string): the demographic that the counts are grouped by; will be displayed in title
Returns:
None
'''
best_reward = np.argmax(counts.flatten()) # identify the index (aka reward_id) of the most frequent reward
fig, ax = plt.subplots(figsize=(16,9)) # create a large figure
barlist = plt.bar(range(11),counts.flatten()/sum(counts)) # add a bar plot of frequencies
barlist[best_reward].set_color('red') # highlight the most frequent reward in red
ax.set_xticks(range(11))
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Recommendation Frequency',fontsize=12)
plt.title('Distribution of Reward Recommendations for {}'.format(demographic),fontsize=16)
plt.show()
# make a bar chart of the best reward counts by reward id
plot_reward_freqs(member_preds_df['best_reward'].value_counts(sort=False).values,'All Members')
def plot_mult_reward_freqs(counts, demographic, legend):
'''
Determines which reward ID is recommended the most for two demographic groups
Plots the recommendation frequencies with the maximum frequency in red/orange, respectively
Inputs:
counts (2x11 array): the counts for the reward options
demographic (string): the demographic that the counts are grouped by; will be displayed in title
legend (list of strings): 2 element list corresponding to the demographic distinction between the two groups, will be displayed in the legend
Returns:
None
'''
best_reward1 = np.argmax(counts[0].flatten()) # identify the index (aka reward_id) of the most frequent reward in demo group 1
best_reward2 = np.argmax(counts[1].flatten()) # identify the index (aka reward_id) of the most frequent reward in demo group 2
fig, ax = plt.subplots(figsize=(16,9)) # make a large figure
width = 0.4 # set the bar width
# plot the first demo group in blue
barlist1 = plt.bar(np.arange(11)-width/2,counts[0].flatten()/sum(counts[0]),width=width,color='blue')
barlist1[best_reward1].set_color('red') # highlight the most frequent reward of demo group 1 in red
# plot the second demo group in green
barlist2 = plt.bar(np.arange(11)+width/2,counts[1].flatten()/sum(counts[1]),width=width,color='green')
barlist2[best_reward2].set_color('orange') # highlight the most frequent reward of demo group 2 in orange
ax.set_xticks(range(11)) # label the reward ids
plt.xlabel('Reward ID',fontsize=12)
plt.ylabel('Recommendation Frequency',fontsize=12)
plt.title('Distribution of Reward Recommendations by {}'.format(demographic),fontsize=16)
# create a custom legend with blue/red rectangles for demo group 1 and green/orange rectangles for demo group 2
blue_patch = mpatches.Patch(color='blue')
red_patch = mpatches.Patch(color='red')
green_patch = mpatches.Patch(color='green')
orange_patch = mpatches.Patch(color='orange')
plt.legend([(blue_patch,red_patch),(green_patch,orange_patch)],[legend[0],legend[1]],numpoints=1,handler_map={tuple:HandlerTuple(ndivide=None,pad=0)})
plt.show()
# split the member_preds_df into younger/older than mean age
mean_age = member_preds_df['age'].mean()
younger = member_preds_df.loc[member_preds_df['age']<=mean_age,'best_reward'].value_counts().sort_index().values
older = member_preds_df.loc[member_preds_df['age']>mean_age,'best_reward'].value_counts().sort_index().values
age_demo = 'Age (Mean is {:2.2f} years)'.format(mean_age)
# plot the reward frequencies by age
plot_mult_reward_freqs([younger,older],age_demo,['Younger than mean age','Older than mean age'])
# split the member_preds_df into less/greater than mean income
mean_income = member_preds_df['income'].mean()
poorer = member_preds_df.loc[member_preds_df['income']<=mean_income,'best_reward'].value_counts().sort_index().values
richer = member_preds_df.loc[member_preds_df['income']>mean_income,'best_reward'].value_counts().sort_index().values
income_demo = 'Income (Mean is ${:.0f})'.format(mean_income)
# plot the reward frequencies by income
plot_mult_reward_freqs([poorer,richer],income_demo,['Below average income','Above average income'])
# split the member_preds_df into before/after 2017 enrollment
begin2017 = datetime(2017,1,1)
before2017 = member_preds_df.loc[member_preds_df['enrollment_date']<begin2017,'best_reward'].value_counts().sort_index().values
after2017 = member_preds_df.loc[member_preds_df['enrollment_date']>=begin2017,'best_reward'].value_counts().sort_index().values
# plot the reward frequencies by enrollment date
plot_mult_reward_freqs([before2017,after2017],'Enrollment Date',['Before 2017','2017 or after'])
# split the member_preds_df into Female/Male
female = member_preds_df.loc[member_preds_df['gender']=='F','best_reward'].value_counts().sort_index().values
male = member_preds_df.loc[member_preds_df['gender']=='M','best_reward'].value_counts().sort_index().values
# plot the reward frequencies by gender (omitting 'Other')
plot_mult_reward_freqs([female,male],'Gender',['Female','Male'])
Looking at all of the members in the profile dataset, the most commonly recommended reward type was Reward 0, or no reward. That was followed closely by Reward 8 (10-day 25\% discount if you spend \$20), Reward 10 (4-day informational) and Reward 9 (3-day informational). After that, there is a large drop in recommendation frequency for the remaining 7 reward types.
To get a better understanding of how the different demographic features impacted reward recommendations, I also plotted bar charts splitting each demographic into two groups. The first group of bars is plotted in blue with the most common reward recommendation plotted in red, while the second group was plotted in green with the most common reward recommendation plotted in orange. This helped me to quickly understand the differences in the distribution of reward recommendations for each group as well as which reward was most popular for each group.
Age
I split members based on whether they were younger or older than the mean age of 54.39 years. Younger members were recommended Reward 8 the most (>25\%) while older members were recommended Reward 0 (>25\%) the most.
Income
I split members by income in the same way as age, whether their income was less than or greater than the mean income of \$65,405. Lower income members were recommended Reward 8 (>35\%) the most and higher income members were recommended Reward 0 (~30\%) the most.
Enrollment Date
I split members by enrollment date based on whether or not they were a member before 2017. The trends in member enrollment showed increasing enrollment numbers from the start of the program in 2013 through 2017 before dropping in 2018 (likely due to an incomplete year of enrollments in the data set). Splitting the members into 2013-2016 and 2017-2018 groups appeared to roughly create two evenly sized groups. The pre-2017 members were recommended Reward 0 (~25\%) the most while members enrolling in 2017 or after were recommended Reward 8 (>25\%) the most.
Gender
Splitting members by gender was easy because it was already a categorical feature. I only included female and male members because they made up the majority of the data set. 'Other' gender types only made up 1\% of the data set so I didn't expect it to have a strong influence on reward predictions. Female members were recommended Reward 0 (~30\%) the most while male members were recommended Reward 8 (~30\%) the most.
Reward 0 and Reward 8 were the most recommended for each demographic distinction. Oddly enough, however, the same reward was not recommended for both groups within a given demographic. This tells me that each demographic did have some impact on the reward recommendations - a good sign for the usefulness of the model.
Looking back at the recommendation frequencies for the entire member data set, one trend that stands out is the fact that three of the top 4 most recommended reward options (Rewards 0, 9, and 10) do not actually offer any reward. Reward 0 represents no reward offer while 9 and 10 are simply informational. Reward 8 is the only monetary reward that is recommended to more than 10\% of the members in the data set and it happens to have the largest minimum spending amount to receive the reward. This reward offers a 25\% discount when a member spends \$20, meaning that the net amount spent by the member is \\$15. I believe that these 4 rewards are recommended the most partly because of how I defined the predicted transaction amounts and the best reward.
For a given member, the make_member_predictions function uses the tuned Bagging regression model to make predictions for the amount a member will spend in a given transaction based on the reward (if any) that is offered. The function returns reward-adjusted transaction amounts, meaning that if a predicted transaction amount exceeds the 'difficulty' (minimum amount required to receive the reward), then the reward amount is subtracted from the predicted transaction amount. I chose to use reward-adjusted transaction amounts as a way to normalize the reward types. For example, if a member is expected to spend \$4 without a reward offer but would spend \\$6 if they received a \$5 BOGO, then it would be better to not offer that member a reward because the reward amount offsets most of the amount they are spending. This leads to the goal of the reward program and how one defines the best reward. In my opinion, the goal of the reward program is to maximize the amount of money spent at Starbucks by members. I assume that reward money is equivalent to transactional money, when in reality it may be significantly less. Starbucks may only value transactional money and not care how much reward money is given out; or the unit cost of that second drink in a BOGO is actually less than the marginal increase in money spent in a single transaction with a BOGO compared to without an offer. In either of those cases, the definition of the best reward offer would have to be adjusted. Because of the assumptions I have made in calculating predicted transaction amounts, any offer where the reward is equal to the minimum spend amount (i.e. the BOGO offers) is greatly weakened by my adjustment. I believe this partly explains why three of the top 4 most recommended offers do not have any reward associated with them (and by extension do not have an adjustment) and the other popular offer in the top 4 has the largest reward-adjusted minimum spend.
The goal of this project was to make recommendations on which reward to offer a member based on their demographic data (age, income, enrollment date, and gender). To accomplish this goal, I was provided three data sets: information about Starbucks rewards member, information about the types of rewards that were offered, and an event log detailing when rewards were sent to members, viewed and completed by members, and the transaction amounts made by members during a 30-day test period. I cleaned these data sets by encoding categorical features and converting data types from dictionaries, lists, and strings to floats, integers, and datetime objects. I then consolidated the event log data so that I knew which reward type (if any) was active for every transaction. This step proved to be one of the most challenging parts of the project. I had to find ways to compare values from different columns across multiple rows of the event log for a subset of the data corresponding to each member. Due to the size of the dataset (17,000 unique members with over 300,000 rows) I had to be very thoughtful with how to efficiently store and recall the relevant information during the feature extraction process. Once I was done, I could finally begin to train various regression models and test their ability to predict transaction amounts based on reward type and demographic data. However, I quickly discovered that the transactional data had some very extreme outliers that made it difficult to train the models or evaluate their performance. This posed another major challenge in this project - how to handle the outliers in the output variable. After doing some research, I learned about several techniques to try to handle the outliers, ultimately deciding to drop them from the modeling dataset because they did not reflect the types of transactions that I was trying to predict. Once the outliers were removed, I continued testing various regression models on transactions between \$0-\\$41 before settling on one - tree based regression with Bagging - to tune using Grid Search. My tuned model had a mean absolute error of \$2.81 on the training set and a mean absolute error of \\$3.83 on the testing data set, numbers with which I was satisfied. I used my tuned model to make predictions for each member in the original data set and each reward type to determine which was the best reward type for each member. I then aggregated the recommended reward by each of the demographics to better understand the recommendations that my model was making. The model most often recommended either no reward or the offer of a 25\% discount on a \$20 transaction, followed by the two informational offers. I addressed the frequency of these reward recommendations in the Results discussion section.
There are two major limitations associated with my modeling approach. First, I assume that each transaction is independent of other transactions. This means that I do not consider the possibility that multiple transactions could occur while an offer is active, which affects both the discount and informational offer types. My model predicts the amount spent in a single transaction and I use those predictions to make a reward recommendation. In reality, there may be members who make multiple transactions over a short period of time that add up to an amount that is greater than the amount predicted when no offer is given. I would expect that accounting for multiple offers would increase the frequency of recommending discount and informational offers.
The second major limitation to my model is that it is only trained on transaction amounts rather than the number of transactions. By focusing on the amount, I effectively assumed that the likelihood of a transaction occurring was constant for each reward type. A more concrete way to think about this is by imagining a customer standing at a register about to place an order. The customer has already decided to purchase something (make a transaction) and now just has to decide what they want to purchase or how much they want to spend. The type of reward offered can definitely impact their purchasing behavior in this scenario - perhaps they were only planning to buy a small coffee but because they had a discount reward decided to buy a pastry as well. However, there is also a significant effect that is not captured by this approach - the members who would not have purchased anything if they had not received an offer. In order to incorporate the probability of a transaction into the reward recommendations, I need to perform some statistical hypothesis tests to determine which reward types have a significant increase in transaction rate compared to the null hypothesis (no reward offer). I would expect this modification to increase the frequency of offers with monetary rewards (BOGO and discount types) over the informational offers or no reward offer.